{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Encrypted Linear Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial you are going to see how you can run a linear regression model on **data distributed in a pool of workers** with **encrypted computations leveraged by Secured Multi-Party Computation**. For this demonstration we are going to use the classical Housing Prices dataset that is already available in the VirtualGrid set up by the Syft Sandbox.\n",
    "\n",
    "The idea for the implementation of the Encrypted Linear Regression algorithm in PySyft is based on the section 2 of [this paper](https://arxiv.org/abs/1901.09531) written by Jonathan Bloom of the Broad Institute of MIT and Harvard.\n",
    "\n",
    "**Author**: André Macedo Farias. Github: [@andrelmfarias](https://github.com/andrelmfarias) | Twitter: [@andrelmfarias](https://twitter.com/andrelmfarias)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Preliminaries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's import PySyft and PyTorch and set up the Syft sandbox, which will create all the objects and tools we will need to run our simulation (Virtual Workers, VirtualGrid with datasets, etc...)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import syft as sy\n",
    "sy.create_sandbox(globals(), verbose=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that we have several workers already set up:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "workers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And each one has a chunk of the Housing Prices dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for worker in workers:\n",
    "    print(worker.search([\"#housing\", \"#data\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Encrypted Linear Regression with PySyft "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Loading Housing Prices data from Virtual Grid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have our Syft environment set, let's load the data.\n",
    "\n",
    "Please note that in order to avoid overflow with the SMPC computations performed by the linear model, and to maintain its stability, **we need to scale the data in a such way that the magnitude of each coordinate average lies in the interval [0.1, 10]**.\n",
    "\n",
    "Usually that can be done without revealing the data or the averages, you only need to have an idea of the order of magnitude. For example, if one of the coordinate is the surface of the house and it is represented in m², you should scale it by dividing by 100, as we know the surfaces of houses have an order of magnitude close to 100 in average.\n",
    "\n",
    "After running the model and obtaining the main statistics, we can rescale them back if needed. The same can be done with predictions.\n",
    "\n",
    "In this tutorial I will be loading the data and scale them following this idea:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scale_data = torch.Tensor([10., 10.,  10., 1., 1., 10., 100., 10., 10., 1000., 10., 1000., 10.])\n",
    "scale_target = 100.0\n",
    "\n",
    "housing_data = []\n",
    "housing_targets = []\n",
    "for worker in workers:\n",
    "    housing_data.append(worker.search([\"#housing\", \"#data\"])[0] / scale_data.send(worker))\n",
    "    housing_targets.append(worker.search([\"#housing\", \"#target\"])[0] / scale_target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Setting up 2 more Virtual workers: the crypto provider and the \"honest but curious\" worker"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to run the linear regression, we will need **two more workers**, a *crypto provider* and a *honest but curious* worker. Both are necessary to assure the security of the SMPC computations when we run the model in a pool with more than 3 workers.\n",
    "\n",
    "> *Note: the **honest but curious** worker is a legitimate participant in a communication protocol who will not deviate from the defined protocol but will attempt to learn all possible information from legitimately received messages.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "crypto_prov = sy.VirtualWorker(hook, id=\"crypto_prov\")\n",
    "hbc_worker = sy.VirtualWorker(hook, id=\"hbc_worker\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Running Encrypted Linear Regression with SMPC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's import the EncryptedLinearRegression from the linalg module of pysyft:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from syft.frameworks.torch.linalg import EncryptedLinearRegression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's train the model!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "crypto_lr = EncryptedLinearRegression(crypto_provider=crypto_prov, hbc_worker=hbc_worker)\n",
    "crypto_lr.fit(housing_data, housing_targets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can display the results with the method `.summarize()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "crypto_lr.summarize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**We can see that the EncryptedLinearRegression does not only give the coefficients and intercept values, but also their standard errors and the p-values!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Comparing results with other linear regressors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, in order to show the effectiveness of the EncryptedLinearRegression, let's compare it with the Linear Regression from other known libraries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Sending data to local server for comparison purposes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's send the data to the local worker and transform the `torch.Tensor`s in `numpy.array`s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "data_tensors = [x.copy().get() for x in housing_data] \n",
    "target_tensors = [y.copy().get() for y in housing_targets]\n",
    "\n",
    "data_np = torch.cat(data_tensors, dim=0).numpy()\n",
    "target_np = torch.cat(target_tensors, dim=0).numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's compare the results with the sklearn's Linear Regression:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "lr = LinearRegression().fit(data_np, target_np.squeeze())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 25)\n",
    "print(\"Sklearn Linear Regression\")\n",
    "print(\"=\" * 25)\n",
    "for i, coef in enumerate(lr.coef_, 1):\n",
    "    print(\" coeff{:<3d}\".format(i), \"{:>14.4f}\".format(coef))\n",
    "print(\" intercept:\", \"{:>12.4f}\".format(lr.intercept_))\n",
    "print(\"=\" * 25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**You can notice that the are results are pretty much the same!! The are some small differences, but they are never higher than 0.2% of the value computed by the sklearn model!!**\n",
    "\n",
    "**For an ecrypted model that can compute linear regression coefficients without ever revealing the data, this is a huge achievement!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 Statsmodel API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do the same using the Linear Regression from Statsmodel API, which also gives us the **standard errors** and **p-values** of the coefficients. We can then compare it with the results given by the EncryptedLinearRegression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm\n",
    "mod = sm.OLS(target_np.squeeze(), sm.add_constant(data_np), hasconst=True)\n",
    "res = mod.fit()\n",
    "print(res.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Once again, we can see that all results are pretty much the same!!**\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Well Done!\n",
    "\n",
    "And voilà! We were able to train an OLS Regression model on distributed data and without ever seeing it. We were even able to compute standard errors and p-values for each coefficient.\n",
    "\n",
    "Also, after comparing our results with results given by other known libraries, we were able to validate this approach."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Congratulations!!! - Time to Join the Community!\n",
    "\n",
    "Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!\n",
    "\n",
    "\n",
    "### Star PySyft on GitHub\n",
    "\n",
    "The easiest way to help our community is just by starring the repositories! This helps raise awareness of the cool tools we're building.\n",
    "\n",
    "- [Star PySyft](https://github.com/OpenMined/PySyft)\n",
    "\n",
    "### Pick our tutorials on GitHub!\n",
    "\n",
    "We made really nice tutorials to get a better understanding of what Federated and Privacy-Preserving Learning should look like and how we are building the bricks for this to happen.\n",
    "\n",
    "- [Checkout the PySyft tutorials](https://github.com/OpenMined/PySyft/tree/master/examples/tutorials)\n",
    "\n",
    "\n",
    "### Join our Slack!\n",
    "\n",
    "The best way to keep up to date on the latest advancements is to join our community! \n",
    "\n",
    "- [Join slack.openmined.org](http://slack.openmined.org)\n",
    "\n",
    "### Join a Code Project!\n",
    "\n",
    "The best way to contribute to our community is to become a code contributor! If you want to start \"one off\" mini-projects, you can go to PySyft GitHub Issues page and search for issues marked `Good First Issue`.\n",
    "\n",
    "- [Good First Issue Tickets](https://github.com/OpenMined/PySyft/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22)\n",
    "\n",
    "### Donate\n",
    "\n",
    "If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!\n",
    "\n",
    "- [Donate through OpenMined's Open Collective Page](https://opencollective.com/openmined)"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
